Context:
In the dataset vehicle-1 we are given various features which describe the outline or shape of the vehicle. There are 3 classes that the vehicles are categorised into: Van, Car or Bus
Objective:
To classify the given features describing the shape of the vehicle as one of the 3 classes mentioned.
#Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,confusion_matrix
from scipy.stats import zscore
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve, auc
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.metrics import roc_auc_score
from itertools import cycle
#load the csv file and make the dataframe
df = pd.read_csv('vehicle-1.csv')
#Getting a glimpse of the head of the dataframe
df.head()
#Renaming some of the attributes by just replacing the "." with "_" to avoid any errors down the line.
df.rename(columns={'pr.axis_aspect_ratio': 'pr_axis_aspect_ratio',
'max.length_aspect_ratio': 'max_length_aspect_ratio',
'pr.axis_rectangularity': 'pr_axis_rectangularity',
'max.length_rectangularity': 'max_length_rectangularity',
'scaled_variance.1': 'scaled_variance_1',
'scaled_radius_of_gyration.1': 'scaled_radius_of_gyration_1',
'skewness_about.1': 'skewness_about_1', 'skewness_about.2': 'skewness_about_2'}, inplace=True)
df.head()
#Shape of the data
print("The dataframe has {} rows and {} columns".format(df.shape[0],df.shape[1]))
#Attribute information
df.info()
Observation: We can see that all attributes are numerical i.e either float or integer except for the class which is object. We can convert the objects in class attribute to numeric as done below.
df = df.replace({'car': 0, 'bus': 1, 'van':2})
df.head()
#fucntion to check how many null values are there
df.apply(lambda x: sum(x.isnull()))
Observation: We can see that there are missing values present. The largest number of missing values is 6 in 'radius_ratio'. We have the option to either drop or impute for the missing values. As dropping is not considered a good practice due to data loss. We will try and impute for the missing values as we proceed to pre-processing.
#5 point summary
vehicle_df.describe().transpose()
Observation: It seams like replacing the missing values with the median is a good option here rather than dropping.
Note: We should make a copy of the dataframe before we start any manipulation.
vehicle_df = df.copy()
#Fill missing values with the median
vehicle_df.fillna(vehicle_df.median(), axis=0, inplace=True)
#Check for missing values again
vehicle_df.apply(lambda x: sum(x.isnull()))
Observation: The missing values have been replaced
plt.figure(figsize=(12,6))
sns.boxplot(data=vehicle_df, orient='h', palette='Set2', dodge=False)
Observation: Outliers exist in some of the attributes. These should be dealt with as we go through visualizing each attribute to get a better understanding.
sns.pairplot(vehicle_df, hue='class', diag_kind='kde')
Observation: In a broad view, we can see that there are plently correlations between attributes. We can also see that most attributes are normally distributed. Also, the presence of outliers becomes more evident with the presence of tails in the distributions. We need to zoom in to further analyze the relationships.
vehicles_counts = pd.DataFrame(vehicle_df['class'].value_counts()).reset_index()
vehicles_counts.columns = ['Labels', 'class']
vehicles_counts['Labels'] = ['Car', 'Bus', 'Van']
vehicles_counts
#Visualizing using Histogram & Pie chart
fig, (ax1, ax2) = plt.subplots(nrows=1,ncols=2, figsize=(15,7))
sns.countplot(vehicle_df['class'], ax = ax1)
ax1.set_xlabel('Vehicle Type', fontsize=10)
ax1.set_ylabel('Count', fontsize=10)
ax1.set_title('Vehicle Type Distribution')
ax1.set_xticklabels(labels=["Car", 'Bus', 'Van'])
explode = (0, 0.1,0)
ax2.pie(vehicles_counts["class"], explode=explode, labels=["Car", 'Bus', 'Van'], autopct='%1.2f%%',
shadow=True, startangle=70)
ax2.axis('equal')
plt.title("Vehicles Types Percentage")
plt.legend(["Car", 'Bus', 'Van'], loc=3)
plt.show()
Observation: Although pie-charts are not the best visual representation of data in most cases, here we can use it to show how closely the data is split approx into 51% for cars and 26% for Bus and 23% for Van. We can see the clear dominance of Cars where the count is more than double of Bus or Van. Also, there is a difference of only 19 counts between Bus and Vans.
We will use distribution plot, box plot as well as a distribution plot with target hue to understand the relationships in depth. Also, we will look at the outliers and try to identify the quantity and ranges where they exist.
Compactness
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['compactness'],ax=ax1)
ax1.tick_params(labelsize=15)
ax1.set_xlabel('Compactness', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['compactness'],ax=ax2)
ax2.set_title("Box Plot")
ax2.set_xlabel('Compactness', fontsize=15)
bins = range(20, 200, 20)
ax3 = sns.distplot(vehicle_df.compactness[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.compactness[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.compactness[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Compactness', fontsize=15)
plt.title("Compactness vs Class")
plt.legend()
Observation: The distribution looks normal. There are no outliers present. The bulk of the data lies between 80-100 range where as the full spread is somewhere in the 60 - 120 range. From the third plot we can see that more the compactness, the more likely the vehicle is a car.
Circularity
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['circularity'],ax=ax1)
ax1.set_xlabel('Circularity', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['circularity'],ax=ax2)
ax2.set_xlabel('Circularity', fontsize=15)
ax2.set_title("Box Plot")
bins = range(10, 100, 10)
ax3 = sns.distplot(vehicle_df.circularity[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.circularity[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.circularity[vehicle_df['class']==2],ax=ax3, color='yellow', kde=False, bins=bins, label="van")
ax3.set_xlabel('Circularity', fontsize=15)
plt.title("Circularity vs Class")
plt.legend()
Observation: We can see that it is normally distributed. The bulk of the data lies between 40-50 range whereas the entire spread is approx between 30-60. Also, high circularity tends to be a car in most cases than a Bus or a Van. We can also observe that there are almost no vans with circularity higher than 50.
Distance Circularity
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['distance_circularity'],ax=ax1)
ax1.set_xlabel('Distance Circularity', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['distance_circularity'],ax=ax2)
ax2.set_xlabel('Distance Circularity', fontsize=15)
ax2.set_title("Box Plot")
bins = range(20, 150, 10)
ax3 = sns.distplot(vehicle_df.distance_circularity[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.distance_circularity[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.distance_circularity[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Distance Circularity', fontsize=15)
plt.title("Distance Circularity vs Class Distribution")
plt.legend()
Observation: We can see that the distribution is normal and has 2 peaks. There is a sharp peak in the range 100-110 showing that in this range the vehicle is very likely to be a car. Also, there are no outliers. However, there is a slight left skewness as a longer tail exists on the left side of the boxplot.
Radius Ratio
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['radius_ratio'],ax=ax1)
ax1.set_xlabel('Radius Ratio', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['radius_ratio'],ax=ax2)
ax2.set_xlabel('Radius Ratio', fontsize=15)
ax2.set_title("Box Plot")
bins = range(10, 400, 20)
ax3 = sns.distplot(vehicle_df.radius_ratio[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.radius_ratio[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.radius_ratio[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Radius Ratio', fontsize=15)
plt.title("Radius Ratio vs Class Distribution")
plt.legend()
Observation: We can see that some outliears exist which also cause the distribution to have a long tail on the right and right skewness. Let's have a closer look at the outliers.
#Checking Outliers
outlier_columns = []
Q1 = vehicle_df['radius_ratio'].quantile(0.25) # 1st Quartile
Q3 = vehicle_df['radius_ratio'].quantile(0.75) # 3rd Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_radius_ratio = Q1 - 1.5 * IQR # lower bound
UTV_radius_ratio = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('radius_ratio <',LTV_radius_ratio ,'and >',UTV_radius_ratio, ' are outliers')
print('Numerber of outliers in radius_ratio column below the lower whisker =', vehicle_df[vehicle_df['radius_ratio'] < (Q1-(1.5*IQR))]['radius_ratio'].count())
print('Numerber of outliers in radius_ratio column above the upper whisker =', vehicle_df[vehicle_df['radius_ratio'] > (Q3+(1.5*IQR))]['radius_ratio'].count())
# storing column name and upper-lower bound value where outliers are present
outlier_columns.append('radius_ratio')
upperLowerBound_Disct = {'radius_ratio':UTV_radius_ratio}
Axis Aspect Ratio
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['pr_axis_aspect_ratio'],ax=ax1)
ax1.set_xlabel('Axis Aspect Ratio ', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['pr_axis_aspect_ratio'],ax=ax2)
ax2.set_xlabel('Axis Aspect Ratio ', fontsize=15)
ax2.set_title("Box Plot")
bins = range(20, 200, 10)
ax3 = sns.distplot(vehicle_df.pr_axis_aspect_ratio[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.pr_axis_aspect_ratio[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.pr_axis_aspect_ratio[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Axis Aspect Ratio', fontsize=15)
plt.title("Axis Aspect Ratio vs Class Distribution")
plt.legend()
Observation: There is a strong right skewness as a fairly long tail exists. This can also be seen as outliers existing in the boxplot.
#Checking Outliers
Q1 = vehicle_df['pr_axis_aspect_ratio'].quantile(0.25) # 1st Quartile
Q3 = vehicle_df['pr_axis_aspect_ratio'].quantile(0.75) # 3rd Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_pr_axis_aspect_ratio = Q1 - 1.5 * IQR # lower bound
UTV_pr_axis_aspect_ratio = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('pr_axis_aspect_ratio <',LTV_pr_axis_aspect_ratio ,'and >',UTV_pr_axis_aspect_ratio, ' are outliers')
print('Numerber of outliers in axis_aspect_ratio column below the lower whisker =', vehicle_df[vehicle_df['pr_axis_aspect_ratio'] < (Q1-(1.5*IQR))]['pr_axis_aspect_ratio'].count())
print('Numerber of outliers in axis_aspect_ratio column above the upper whisker =', vehicle_df[vehicle_df['pr_axis_aspect_ratio'] > (Q3+(1.5*IQR))]['pr_axis_aspect_ratio'].count())
# storing column name and upper-lower bound value where outliers are present
outlier_columns.append('pr_axis_aspect_ratio')
upperLowerBound_Disct['pr_axis_aspect_ratio'] = UTV_pr_axis_aspect_ratio
Max Length Aspect Ratio
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['max_length_aspect_ratio'],ax=ax1)
ax1.set_xlabel('Max Length Aspect Ratio', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['max_length_aspect_ratio'],ax=ax2)
ax2.set_xlabel('Max Length Aspect Ratio', fontsize=15)
ax2.set_title("Box Plot")
bins = range(10, 100, 10)
ax3 = sns.distplot(vehicle_df.max_length_aspect_ratio[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.max_length_aspect_ratio[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.max_length_aspect_ratio[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Max Length Aspect Ratio', fontsize=15)
plt.title("Max Length Aspect Ratio vs Class Distribution")
plt.legend()
Observation:In this case as well, there is large right skewness due to the many outliers stretching beyond the upper whisker. Most cars have a max lenght aspect ratio between 10-20, whereas busses lies between
#Checking Outliers
Q1 = vehicle_df['max_length_aspect_ratio'].quantile(0.25) # 1º Quartile
Q3 = vehicle_df['max_length_aspect_ratio'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_length_aspect_ratio = Q1 - 1.5 * IQR # lower bound
UTV_length_aspect_ratio = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('length_aspect_ratio <',LTV_length_aspect_ratio ,'and >',UTV_length_aspect_ratio, ' are outliers')
print('Numerber of outliers in length_aspect_ratio column below the lower whisker =',
vehicle_df[vehicle_df['max_length_aspect_ratio'] < (Q1-(1.5*IQR))]['max_length_aspect_ratio'].count())
print('Numerber of outliers in length_aspect_ratio column above the upper whisker =',
vehicle_df[vehicle_df['max_length_aspect_ratio'] > (Q3+(1.5*IQR))]['max_length_aspect_ratio'].count())
outlier_columns.append('max_length_aspect_ratio')
# storing column name and upper-lower bound value where outliers are present
outlier_columns.append(LTV_length_aspect_ratio)
upperLowerBound_Disct['length_aspect_ratio_LTV'] = LTV_length_aspect_ratio
upperLowerBound_Disct['length_aspect_ratio_UTV'] = UTV_length_aspect_ratio
Scatter Ratio
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['scatter_ratio'],ax=ax1)
ax1.set_xlabel('Scatter Ratio', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['scatter_ratio'],ax=ax2)
ax2.set_xlabel('Scatter Ratio', fontsize=15)
ax2.set_title("Box Plot")
bins = range(10, 300, 10)
ax3 = sns.distplot(vehicle_df.scatter_ratio[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.scatter_ratio[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.scatter_ratio[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Scatter Ratio', fontsize=15)
plt.title("Scatter Ratio vs Class Distribution")
plt.legend()
Observation: We can see that there are no outliers in scatter_ratio column and there are two peaks in distribution plot and there is right skewness because long tail is at right side
Elongatedness
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['elongatedness'],ax=ax1)
ax1.set_xlabel('Elongatedness', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['elongatedness'],ax=ax2)
ax2.set_xlabel('Elongatedness', fontsize=15)
ax2.set_title("Box Plot")
bins = range(10, 100, 10)
ax3 = sns.distplot(vehicle_df.elongatedness[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.elongatedness[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.elongatedness[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Elongatedness', fontsize=15)
plt.title("Elongatedness vs Class Distribution")
plt.legend()
Observation: we can see that there are no outliers in elongatedness column and there are two peaks in distribution plot and there is slight right skewness. Most of the cars elongatedness value ranges between 30 to 40.
Axis Rectangularity
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['pr_axis_rectangularity'],ax=ax1)
ax1.set_xlabel('Axis Rectangularity', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['pr_axis_rectangularity'],ax=ax2)
ax2.set_xlabel('Axis Rectangularity', fontsize=15)
ax2.set_title("Box Plot")
bins = range(10, 50, 10)
ax3 = sns.distplot(vehicle_df.pr_axis_rectangularity[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.pr_axis_rectangularity[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.pr_axis_rectangularity[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Axis Rectangularity', fontsize=15)
plt.title("Axis Rectangularity vs Class Distribution")
plt.legend()
Observation: We can see that there are no outliers in pr.axis_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side.
Max Length Rectangularity
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['max_length_rectangularity'],ax=ax1)
ax1.set_xlabel('Max Length Rectangularity', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['max_length_rectangularity'],ax=ax2)
ax2.set_xlabel('Max Length Rectangularity', fontsize=15)
ax2.set_title("Box Plot")
bins = range(100, 300, 10)
ax3 = sns.distplot(vehicle_df.max_length_rectangularity[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.max_length_rectangularity[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.max_length_rectangularity[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Max Length Rectangularity', fontsize=15)
plt.title("Max Length Rectangularity vs Class Distribution")
plt.legend()
Observation: We can see that there are no outliers in max.length_rectangularity column and there are two peaks in distribution plot and there is right skewness because long tail is at right side
Scaled Variance
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['scaled_variance'],ax=ax1)
ax1.set_xlabel('Scaled Variance', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['scaled_variance'],ax=ax2)
ax2.set_xlabel('Scaled Variance', fontsize=15)
ax2.set_title("Box Plot")
bins = range(100, 500, 10)
ax3 = sns.distplot(vehicle_df.scaled_variance[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.scaled_variance[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.scaled_variance[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Scaled Variance', fontsize=15)
plt.title("Scaled Variance vs Class Distribution")
plt.legend()
Observation: we can see that there are outliers in scaled_variance column and there are two peaks in distribution plot and there is right skewness because long tail is at right side. Also, there is a peak around 220-240 range with large number of car.
#Checking Outliers
Q1 = vehicle_df['scaled_variance'].quantile(0.25) # 1º Quartile
Q3 = vehicle_df['scaled_variance'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_scaled_variance = Q1 - 1.5 * IQR # lower bound
UTV_scaled_variance = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('scaled_variance <',LTV_scaled_variance ,'and >',UTV_scaled_variance, ' are outliers')
print('Numerber of outliers in scaled_variance column below the lower whisker =',
vehicle_df[vehicle_df['scaled_variance'] < (Q1-(1.5*IQR))]['scaled_variance'].count())
print('Numerber of outliers in scaled_variance column above the upper whisker =',
vehicle_df[vehicle_df['scaled_variance'] > (Q3+(1.5*IQR))]['scaled_variance'].count())
# storing column name and upper-lower bound value where outliers are present
outlier_columns.append('scaled_variance')
upperLowerBound_Disct['scaled_variance'] = UTV_scaled_variance
Scaled Variance_1
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['scaled_variance_1'],ax=ax1)
ax1.set_xlabel('Scaled Variance_1', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['scaled_variance_1'],ax=ax2)
ax2.set_xlabel('Scaled Variance_1', fontsize=15)
ax2.set_title("Box Plot")
bins = range(100, 1500, 10)
ax3 = sns.distplot(vehicle_df.scaled_variance_1[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.scaled_variance_1[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.scaled_variance_1[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Scaled Variance_1', fontsize=15)
plt.title("Scaled Variance_1 vs Class Distribution")
plt.legend()
Observation: We can see that there are outliers in scaled_variance.1 column and there are two peaks in distribution plot and there is right skewness because long tail is at right side.
#Checking outliers
Q1 = vehicle_df['scaled_variance_1'].quantile(0.25) # 1º Quartile
Q3 = vehicle_df['scaled_variance_1'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_scaled_variance_1 = Q1 - 1.5 * IQR # lower bound
UTV_scaled_variance_1 = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('scaled_variance_1 <',LTV_scaled_variance_1 ,'and >',UTV_scaled_variance_1, ' are outliers')
print('Numerber of outliers in scaled_variance_1 column below the lower whisker =',
vehicle_df[vehicle_df['scaled_variance_1'] < (Q1-(1.5*IQR))]['scaled_variance_1'].count())
print('Numerber of outliers in scaled_variance_1 column above the upper whisker =',
vehicle_df[vehicle_df['scaled_variance_1'] > (Q3+(1.5*IQR))]['scaled_variance_1'].count())
# storing column name and upper-lower bound value where outliers are present
outlier_columns.append('scaled_variance_1')
upperLowerBound_Disct['scaled_variance_1'] = UTV_scaled_variance_1
Scaled Radius of Gyration
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['scaled_radius_of_gyration'],ax=ax1)
ax1.set_xlabel('Scaled Radius of Gyration', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['scaled_radius_of_gyration'],ax=ax2)
ax2.set_xlabel('Scaled Radius of Gyration', fontsize=15)
ax2.set_title("Box Plot")
bins = range(100, 300, 10)
ax3 = sns.distplot(vehicle_df.scaled_radius_of_gyration[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.scaled_radius_of_gyration[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.scaled_radius_of_gyration[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Scaled Radius of Gyration', fontsize=15)
plt.title("Scaled Radius of Gyration vs Class Distribution")
plt.legend()
Observation: We can see that there are no outliers in scaled_radius_of_gyration column and there is right skewness because long tail is at right side.
Scaled Radius of Gyration_1
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['scaled_radius_of_gyration_1'],ax=ax1)
ax1.set_xlabel('Scaled Radius of Gyration_1', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['scaled_radius_of_gyration_1'],ax=ax2)
ax2.set_xlabel('Scaled Radius of Gyration_1', fontsize=15)
ax2.set_title("Box Plot")
bins = range(50, 200, 10)
ax3 = sns.distplot(vehicle_df.scaled_radius_of_gyration_1[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.scaled_radius_of_gyration_1[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.scaled_radius_of_gyration_1[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Scaled Radius of Gyration_1', fontsize=15)
plt.title("Scaled Radius of Gyration_1 vs Class Distribution")
plt.legend()
Observation: We can see that there are outliers in scaled_radius_of_gyration.1 column and there is right skewness because long tail is at right side.
#Checking for Outliers
Q1 = vehicle_df['scaled_radius_of_gyration_1'].quantile(0.25) # 1º Quartile
Q3 = vehicle_df['scaled_radius_of_gyration_1'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_scaled_radius_of_gyration_1 = Q1 - 1.5 * IQR # lower bound
UTV_scaled_radius_of_gyration_1 = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('scaled_radius_of_gyration_1 <',LTV_scaled_radius_of_gyration_1 ,'and >',UTV_scaled_radius_of_gyration_1, ' are outliers')
print('Numerber of outliers in scaled_radius_of_gyration_1 column below the lower whisker =',
vehicle_df[vehicle_df['scaled_radius_of_gyration_1'] < (Q1-(1.5*IQR))]['scaled_radius_of_gyration_1'].count())
print('Numerber of outliers in scaled_radius_of_gyration_1 column above the upper whisker =',
vehicle_df[vehicle_df['scaled_radius_of_gyration_1'] > (Q3+(1.5*IQR))]['scaled_radius_of_gyration_1'].count())
# storing column name and upper-lower bound value where outliers are present
outlier_columns.append('scaled_radius_of_gyration_1')
upperLowerBound_Disct['scaled_radius_of_gyration_1'] = UTV_scaled_radius_of_gyration_1
Skewness About
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['skewness_about'],ax=ax1)
ax1.set_xlabel('Skewness About', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['skewness_about'],ax=ax2)
ax2.set_xlabel('Skewness About', fontsize=15)
ax2.set_title("Box Plot")
bins = range(0, 50, 10)
ax3 = sns.distplot(vehicle_df.skewness_about[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.skewness_about[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.skewness_about[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Skewness About', fontsize=15)
plt.title("Skewness About vs Class Distribution")
plt.legend()
Observation: We can see that there are outliers in skewness_about column and there is right skewness because long tail is at right side
#Cheecking for Outliers
Q1 = vehicle_df['skewness_about'].quantile(0.25) # 1º Quartile
Q3 = vehicle_df['skewness_about'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_skewness_about = Q1 - 1.5 * IQR # lower bound
UTV_skewness_about = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('skewness_about <',LTV_skewness_about ,'and >',UTV_skewness_about, ' are outliers')
print('Numerber of outliers in skewness_about column below the lower whisker =',
vehicle_df[vehicle_df['skewness_about'] < (Q1-(1.5*IQR))]['skewness_about'].count())
print('Numerber of outliers in skewness_about column above the upper whisker =',
vehicle_df[vehicle_df['skewness_about'] > (Q3+(1.5*IQR))]['skewness_about'].count())
# storing column name and upper-lower bound value where outliers are present
outlier_columns.append('skewness_about')
upperLowerBound_Disct['skewness_about'] = UTV_skewness_about
Skewness About_1
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['skewness_about_1'],ax=ax1)
ax1.set_xlabel('Skewness About_1', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['skewness_about_1'],ax=ax2)
ax2.set_xlabel('Skewness About_1', fontsize=15)
ax2.set_title("Box Plot")
bins = range(0, 50, 10)
ax3 = sns.distplot(vehicle_df.skewness_about_1[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.skewness_about_1[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.skewness_about_1[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Skewness About_1', fontsize=15)
plt.title("Skewness About_1 vs Class Distribution")
plt.legend()
Observation: We can see that there are outliers in skewness_about.1 column and there is right skewness because long tail is at right side.
#Checking Outliers
Q1 = vehicle_df['skewness_about_1'].quantile(0.25) # 1º Quartile
Q3 = vehicle_df['skewness_about_1'].quantile(0.75) # 3º Quartile
IQR = Q3 - Q1 # Interquartile range
LTV_skewness_about_1 = Q1 - 1.5 * IQR # lower bound
UTV_skewness_about_1 = Q3 + 1.5 * IQR # upper bound
print('Interquartile range = ', IQR)
print('skewness_about_1 <',LTV_skewness_about_1 ,'and >',UTV_skewness_about_1, ' are outliers')
print('Numerber of outliers in skewness_about_1 column below the lower whisker =',
vehicle_df[vehicle_df['skewness_about_1'] < (Q1-(1.5*IQR))]['skewness_about_1'].count())
print('Numerber of outliers in skewness_about_1 column above the upper whisker =',
vehicle_df[vehicle_df['skewness_about_1'] > (Q3+(1.5*IQR))]['skewness_about_1'].count())
# storing column name and upper-lower bound value where outliers are presense
outlier_columns.append('skewness_about_1')
upperLowerBound_Disct['skewness_about_1'] = UTV_skewness_about_1
Skewness About_2
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['skewness_about_2'],ax=ax1)
ax1.set_xlabel('Skewness About_2', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['skewness_about_2'],ax=ax2)
ax2.set_xlabel('Skewness About_2', fontsize=15)
ax2.set_title("Box Plot")
bins = range(100, 300, 10)
ax3 = sns.distplot(vehicle_df.skewness_about_2[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.skewness_about_2[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.skewness_about_2[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Skewness About_2', fontsize=15)
plt.title("Skewness About_2 vs Class Distribution")
plt.legend()
Observation: We can see that there are no outliers in skewness_about.2 column and there is left skewness because long tail is at left side.
Hollows Ratio
fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
fig.set_size_inches(20,6)
sns.distplot(vehicle_df['hollows_ratio'],ax=ax1)
ax1.set_xlabel('Hollows Ratio', fontsize=15)
ax1.set_title("Distribution Plot")
sns.boxplot(vehicle_df['hollows_ratio'],ax=ax2)
ax2.set_xlabel('Hollows Ratio', fontsize=15)
ax2.set_title("Box Plot")
#fig, (ax1, ax2, x3) = plt.subplots(nrows = 1, ncols = 3, figsize = (13, 5))
#ax3.set_xlabel('Distance Circularity', fontsize=15)
bins = range(100, 300, 10)
ax3 = sns.distplot(vehicle_df.hollows_ratio[vehicle_df['class']==0], color='red', kde=False, bins=bins, label='Car')
sns.distplot(vehicle_df.hollows_ratio[vehicle_df['class']==1],ax=ax3, color='blue', kde=False, bins=bins, label="Bus")
sns.distplot(vehicle_df.hollows_ratio[vehicle_df['class']==2],ax=ax3, color='cyan', kde=False, bins=bins, label="van")
ax3.set_xlabel('Hollows Ratio', fontsize=15)
plt.title("Hollows Ratio vs Class Distribution")
plt.legend()
Observation: There are no outliers in hollows_ratio column and there is left skewness because long tail is at left side.
print('These are the columns which have outliers : \n\n',outlier_columns)
print('\n\n',upperLowerBound_Disct)
We can replace the outliers with the respective medians because if we drop them then there may be loss of information. As always before manipulating the dataframe, we will create a new copy of it.
#Copy new dataframe
vehicle_df_new = vehicle_df.copy()
#Replace outliers with medians
for col_name in vehicle_df_new.columns[:-1]:
q1 = vehicle_df_new[col_name].quantile(0.25)
q3 = vehicle_df_new[col_name].quantile(0.75)
iqr = q3 - q1
low = q1-1.5*iqr
high = q3+1.5*iqr
vehicle_df_new.loc[(vehicle_df_new[col_name] < low) | (vehicle_df_new[col_name] > high), col_name] = vehicle_df_new[col_name].median()
#Visualize if any outliers still exist
plt.figure(figsize=(15,8))
sns.boxplot(data=vehicle_df_new, orient="h", palette="Set2", dodge=False)
We have succesfully eliminated all the outliers from our dataframe. Now we can look into the correlation of the attributes.
vehicle_df_new.corr()
# We can capture the attributes with correlation greater than 95%
corr_matrix = vehicle_df_new.corr().abs()
high_corr_var=np.where(corr_matrix>0.95)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]
high_corr_var
mask = np.zeros_like(vehicle_df_new.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
plt.figure(figsize=(15,7))
plt.title('Correlation Heatmap', y=1.05, size=19)
sns.heatmap(vehicle_df_new.corr(),vmin=-1, cmap='plasma',annot=True, mask=mask, fmt='.2f')
Observation: We can observe that there are many highly correlated attributes such as scatter_ratio and scaled_variation, pr_axis_rectangularity also circularity and max_lenght_rectangularity and many more.
We have the option to either drop such high correlations or to go with dimension reduction techniques such as Principal Component Analysis (PCA).
# Dropping class variable
X = vehicle_df_new.drop('class', axis =1)
y = vehicle_df_new['class']
X.shape
y.shape
#Standardization
X_scaled = X.apply(zscore)
plt.figure(figsize=(12,6))
sns.boxplot(data=X_scaled, orient="h", palette="Set2", dodge=False)
We can see the rescaled data with μ=0 and σ=1
#Spliting
X_train,X_test,y_train,y_test = train_test_split(X_scaled, y, test_size=0.30, random_state=7)
print('x train data {}'.format(X_train.shape))
print('y train data {}'.format(y_train.shape))
print('x test data {}'.format(X_test.shape))
print('y test data {}'.format(y_test.shape))
# prepare cross validation and array nomenclature (Note that random_state while splitting the data should be same as
# the seed for k fold cross validation. Here we will use 10 splits)
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
kfold
cv_results = [] # to store cross validation result
model_names = [] # to store each model name
models_results = [] # to store the model final result with accuracy, precisio, recall, f1-score and cv_results
target_names = ['car', 'bus', 'van']
#Using GridSearch Cross-Validation to identify the best Hyperparameters
svm_model = SVC(gamma="scale")
params = {'kernel': ['linear', 'rbf'], 'C':[0.01, 0.1, 0.5, 1]}
gridSearch_model = GridSearchCV(svm_model, param_grid=params, cv=5)
gridSearch_model.fit(X_train, y_train)
print("Best Hyper Parameters:\n", gridSearch_model.best_params_)
rawData_svm = SVC(C = 1, kernel = 'rbf', gamma= "auto")
rawData_svm.fit(X_train, y_train)
# Predicting for test set
rawData_svm_y_predicted = rawData_svm.predict(X_test)
rawData_svm_Score = rawData_svm.score(X_test, y_test)
rawData_svm_Accuracy = accuracy_score(y_test, rawData_svm_y_predicted)
rawData_svm_classification_Report = metrics.classification_report(y_test, rawData_svm_y_predicted, target_names=target_names)
rawData_svm_classification_Report_dict = metrics.classification_report(y_test, rawData_svm_y_predicted, target_names=target_names, output_dict=True)
rawData_svm_precision_weightedAvg = rawData_svm_classification_Report_dict['weighted avg']['precision']
rawData_svm_recall_weightedAvg = rawData_svm_classification_Report_dict['weighted avg']['recall']
rawData_svm_f1_score_weightedAvg = rawData_svm_classification_Report_dict['weighted avg']['f1-score']
rawData_svm_confusion_matrix = metrics.confusion_matrix(y_test, rawData_svm_y_predicted)
print('\nRawData SVM: \n', rawData_svm_confusion_matrix)
print('\nRawData SVM classification Report : \n', rawData_svm_classification_Report)
# Cross Validation
rawData_svm_cross_validation_result = model_selection.cross_val_score(rawData_svm, X_scaled, y, cv=kfold, scoring='accuracy')
cv_results.append(rawData_svm_cross_validation_result)
model_names.append('RawData-SVM')
rawData_svm_model_results = pd.DataFrame([['RawData SVM (RBF)', 'No', rawData_svm_Accuracy, rawData_svm_precision_weightedAvg, rawData_svm_recall_weightedAvg,
rawData_svm_f1_score_weightedAvg, rawData_svm_cross_validation_result.mean(),
rawData_svm_cross_validation_result.std()]],
columns = ['Model', 'PCA', 'Accuracy', 'Precision', 'Recall',
'F1-Score', 'CV Mean', 'CV Std Deviation'])
models_results = rawData_svm_model_results
models_results
We get an accuracy of about 97.6% which is an exceptional result. Now we will perform PCA to reduce the number of dimension and see if we can get a better result.
First step is creating a covariance matrix. Since we have 18 independent variables, the covariance matrix will be 18x18.
covMatrix = np.cov(X_scaled, rowvar=False)
print('Covarinace Matrix Shape:', covMatrix.shape)
print('Covarinace Matrix:\n', covMatrix)
eig_vals, eig_vecs = np.linalg.eig(covMatrix)
tot = sum(eig_vals)
var_exp = [( i /tot ) * 100 for i in sorted(eig_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
We will now find the Eigen Vectors and Eigen Values:
pca = PCA(n_components=18)
pca.fit(X_scaled)
Eigen Values:
print(pca.explained_variance_)
Eigen Vectors:
print(pca.components_)
Percentage of variation explained by each eigen Vector:
print(pca.explained_variance_ratio_)
plt.bar(list(range(1,19)),pca.explained_variance_ratio_,alpha=1, align='center')
plt.ylabel('Variation explained')
plt.xlabel('Eigen Value/Component')
plt.show()
plt.plot(var_exp)
Observation: We can see a sharp elbow around the 2.5 point on the Eigen Value/Component axis. However, there is also a slight drop in the range 2.5 to 7.5 on the X-axis. After which, there is very little slope observed which means that as the number eigen components increases beyond 7.5, there is no marginal increase in the variation explained. We can confirm this with a ROC diagram as done below.
plt.step(list(range(1,19)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('cummalative of variation explained')
plt.xlabel('Eigen/Components Value')
plt.show()
# Ploting
plt.bar(range(1, eig_vals.size + 1), var_exp, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, eig_vals.size + 1), cum_var_exp, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()
Observation: This plot tells us that selecting 8 components we can explain approximately 95% of the total variance of the data.
PCA with the 8 best components:
pca_eight_components = PCA(n_components=8)
pca_eight_components.fit(X_scaled)
Transforming out dataframe which has 18 dimensions to 8 dimension:
X_scaled_pca_eight_attr = pca_eight_components.transform(X_scaled)
X_scaled_pca_eight_attr.shape
Converting PCA Transformed data from Array to Dataframe to visualise in a pairplot:
pca_datafram=pd.DataFrame(X_scaled_pca_eight_attr)
sns.pairplot(pca_datafram, diag_kind = 'kde')
Observation: We can see that all the attributes are distributed evenly with no specific relationship. The purpose of our PCA is achieved by reducing the dimensions from 18 to 8. Now we can use this data to train the SVM model and check for the difference. To do so, we will start with splitting our PCA transformed data in train and test sets as below.
#PCA tranfromed data spliting
pca_X_train,pca_X_test,pca_y_train,pca_y_test = train_test_split(X_scaled_pca_eight_attr, y, test_size=0.30, random_state=7)
print('-----------------------Origina Data----------------------------- \n')
print('x train data {}'.format(X_train.shape))
print('y train data {}'.format(y_train.shape))
print('x test data {}'.format(X_test.shape))
print('y test data {}'.format(y_test.shape))
print('\n\n-----------------------PCA Transformed Data-----------------------------\n')
print('x train data {}'.format(pca_X_train.shape))
print('y train data {}'.format(pca_y_train.shape))
print('x test data {}'.format(pca_X_test.shape))
print('y test data {}'.format(pca_y_test.shape))
#Using GridSearch Cross-Validation to identify the best Hyperparameters
svm_model = SVC(gamma="scale")
params = {'kernel': ['linear', 'rbf'], 'C':[0.01, 0.1, 0.5, 1]}
gridSearch_model = GridSearchCV(svm_model, param_grid=params, cv=5)
gridSearch_model.fit(X_scaled_pca_eight_attr, y)
print("Best Hyper Surface Parameters found by GridSearch:\n", gridSearch_model.best_params_)
pca_svm = SVC(C = 1, kernel = 'rbf', gamma= "auto")
pca_svm.fit(pca_X_train, pca_y_train)
# Predicting for test set
pca_svm_y_predicted = pca_svm.predict(pca_X_test)
pca_svm_Score = pca_svm.score(pca_X_test, pca_y_test)
pca_svm_Accuracy = accuracy_score(pca_y_test, pca_svm_y_predicted)
pca_svm_classification_Report = metrics.classification_report(pca_y_test, pca_svm_y_predicted, target_names=target_names)
pca_svm_classification_Report_dict = metrics.classification_report(pca_y_test, pca_svm_y_predicted, target_names=target_names, output_dict=True)
pca_svm_precision_weightedAvg = pca_svm_classification_Report_dict['weighted avg']['precision']
pca_svm_recall_weightedAvg = pca_svm_classification_Report_dict['weighted avg']['recall']
pca_svm_f1_score_weightedAvg = pca_svm_classification_Report_dict['weighted avg']['f1-score']
pca_svm_confusion_matrix = metrics.confusion_matrix(pca_y_test, pca_svm_y_predicted)
print('\nPCA SVM: \n', pca_svm_confusion_matrix)
print('\nPCA SVM classification Report : \n', pca_svm_classification_Report)
# Cross Validation
pca_svm_cross_validation_result = model_selection.cross_val_score(pca_svm, X_scaled_pca_eight_attr, y, cv=kfold, scoring='accuracy')
cv_results.append(pca_svm_cross_validation_result)
model_names.append('PCA-SVM')
pca_svm_model_results = pd.DataFrame([['PCA SVM (RBF)', 'Yes', pca_svm_Accuracy, pca_svm_precision_weightedAvg, pca_svm_recall_weightedAvg,
pca_svm_f1_score_weightedAvg,pca_svm_cross_validation_result.mean(),
pca_svm_cross_validation_result.std()]],
columns = ['Model', 'PCA', 'Accuracy', 'Precision', 'Recall',
'F1-Score', 'CV Mean', 'CV Std Deviation'])
models_results = models_results.append(pca_svm_model_results, ignore_index=True)
models_results
fig, axs = plt.subplots(nrows = 2, ncols = 3, figsize = (25,15))
# Confusion matrix for SVM with PCA transformed data and SVM with Raw data
cm = metrics.confusion_matrix(pca_y_test, pca_svm_y_predicted, labels=[0, 1, 2])
df_cm = pd.DataFrame(cm, index = [i for i in ["Car","Bus", "Van"]],
columns = [i for i in ["Car","Bus", "Van"]])
sns.heatmap(df_cm, annot=True , annot_kws={'size': 15}, fmt='g', ax = axs[0,0])
axs[0,0].tick_params(labelsize=20)
axs[0,0].set_xlabel('Predicted Labels', fontsize=20);
axs[0,0].set_ylabel('Actual Labels', fontsize=20);
axs[0,0].set_title('PCA SVM(rbf)', fontsize=20);
cm = metrics.confusion_matrix(y_test, rawData_svm_y_predicted, labels=[0, 1, 2])
df_cm = pd.DataFrame(cm, index = [i for i in ["Car","Bus", "Van"]],
columns = [i for i in ["Car","Bus", "Van"]])
sns.heatmap(df_cm, annot=True , annot_kws={'size': 15}, fmt='g', ax = axs[0,1])
axs[0,1].tick_params(labelsize=20)
axs[0,1].set_xlabel('Predicted Labels', fontsize=20);
axs[0,1].set_ylabel('Actual Labels', fontsize=20);
axs[0,1].set_title('SVM(rbf)', fontsize=20);
plt.subplots_adjust(hspace=0.4, wspace=0.4)
plt.show()
models_results
We have total 254 test data out of which 127 cars, 66 buses and 61 vans.
RawData SVM
We got an accuracy of 97.6% for Car i.e (124/127 100)
We got an accuracy of 98.4% for Bus i.e (65/66 100)
We got an accuracy of 96.7% for Van i.e (59/61 100)
PCA SVM
We got an accuracy of 96.8% for Car i.e (123/127 100)
We got an accuracy of 98.4% for Bus i.e (65/66 100)
We got an accuracy of 91.8% for Van i.e (56/61 100)
Difference in accuracy is only about 1.5% between Raw data and PCA data. Whereas, the difference in CV Mean is merely 0.5%.
Also, PCA makes our data abstract where we cannot really make interpretations with the findings.
Hence, we can conclude that both the cases are very close, however, SVM using raw data is better here than using PCA data.